Use case example¶
Terminal access¶
In Linux, you can open a terminal with ctrl + alt + t
.
In Windows, you can search for the command prompt (cmd
in start menu then open command prompt) or you can access PowerShell by pressing Shift + mouse right-click
on desktop or in the desired folder.
Generate a ssh key pair¶
In order to access the DGX, you need a SSH key. To generate it, you can use ssh-keygen
in a terminal. Then you can get the public key in ~/.ssh/filename.pub
(Linux/MacOS) or in C:\Users\USERNAME\.ssh\filename.pub
(Windows).
Connection¶
Once your account has been created, you can now access the DGX using SSH.
Using a SSH client on Linux/MacOS¶
By default, you can use the following command:
user@mycomputer:~$ ssh username@hubia-dgx.centralesupelec.fr
You can also edit your ~/.ssh/config
file by adding, for example:
Host dgx
HostName hubia-dgx.centralesupelec.fr
User username
IdentityFile ~/.ssh/private_key_name
which will allow you to directly use the command ssh dgx
.
Using a SSH client on Windows (PuTTY)¶
To access the DGX from a Windows machine with a SSH client, you can use PuTTY.
When asked to configure your connection, you have to fill the "Host Name (or IP address)" field with "hubia-dgx.centralesupelec.fr" and ensure that the "connection type" is set to "SSH".
When Putty is configured correctly, you just have to click the "Open" button and a refer to the previous section in order to log in.
Network restrictions¶
Please note that the DGX can only be accessed from the Eduroam network on the Paris-Saclay campus or via VPN.
On the DGX¶
When the SSH tunnel is established, congratulations, you are now connected to the DGX!
From there, you can ask for an interactive session or launch a batch job (see page on slurm jobs management for more information on the different options, or the examples below).
Using VSCode¶
The DGX is a server, and you don't have access to a graphical interface. However, now that the configuration is done, you can launch VSCode on your machine and connect to the DGX using the Remote-SSH extension.
Once this extension has been installed, you can click on the icon at the bottom left of VSCode, and select Remote-SSH: Connect to Host...
. You can then select dgx
from the list of hosts, and connect to the DGX. You will then be able to open your user folder.
Some users have encountered problems with the VSCode Remote-SSH extension. To resolve them, they had to perform the following operations:
Ctrl+Shift+P
, Remote SSH: Uninstall VS Code Server from Host
, Remote SSH: Connect current window to host
.
More information on VSCode key-based authentication on this guide.
Using python virtual environment¶
The easiest way to manage your libraries is to create your own virtual environment. You can make one for all your projects, or one per project to avoid conflicts. The important thing is not to install your libraries directly on the global environment and not to forget to activate your environment before launching your code.
To create a python virtual environment, simply type python -m venv your_env_name
in your terminal, from the directory where you want to create it.
You can then activate it by typing source ./your_env_name/bin/activate
. You can now install your libraries with pip install some_cool_library_I_need
.
At the end of your session, you can close your venv with the command deactivate
.
If you are using VSCode, in order to activate your venv, you first need to open the project and then click on the environment name in the bottom right corner of VSCode. You can then select your environment manually by going to your environment folder, then bin, and selecting python.
Using an interactive session¶
With an interactive session, you can write your code and test it on small datasets running on the MIGs (Multi-Instance GPUs): * the partition must be interactive10 ; * the reserved MIG must be one 1g.10gb * the total CPUs requested (ntasks * cpus-per-task) must not exceed 4 CPUs * example for a one hour interactive session:
srun --partition=interactive10 --gres=gpu:1g.10gb:1 --ntasks=1 --cpus-per-task=4 --time=<1:00:00> --pty bash
The max walltime, which is also the default, is two hours. Once your session starts (it can take some time if the MIGs are already used), you can activate you virtual environment and start working on your code and your tests!
At any time, you can close the session with exit, which will also end the job and free the MIG for other users.
Using a batch job¶
Now that your code is ready to run on bigger datasets, you want to use the MIG for longer computing, and/or use bigger MIGs. For that, you will use the batch job.
Writing the job script¶
Suppose you want to execute a main.py file. Here's a fairly general template of a script job.batch which run a main.py
file, including all mandatory directives (partition, gres, ntasks and cpus-per-task):
#!/bin/bash
#
#SBATCH --job-name=job
#SBATCH --output /path/to/slurm-%j.out
#SBATCH --error /path/to/slurm-%j.err
## For partition: either prod10, prod 20, prod 40 or prod80
#SBATCH --partition=prod10
## For gres: either 1g.10gb:1 for prod10, 2g.20gb:1 for prod20, 3g.40gb:1 for prod40 or A100.80gb:1 for prod80.
#SBATCH --gres=gpu:1g.10gb:1
## For ntasks and cpus: total requested cpus (ntasks * cpus-per-task) must be in [1: 4 * nMIG] with nMIG = VRAM / 10 (hence 1, 2, 4, 8 for 1g.10gb, 2g.20gb, 3g.40gb, A100.80gb).
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
## Perform run
python3 /path/to/main.py
Another one using a virtual environment, a logslurm directory (for the output and error) and a working_directory (containing the main.py
) in the user home:
#!/bin/bash
#
#SBATCH --job-name=job
#SBATCH --output=~/logslurm/slurm-%j.out
#SBATCH --error=~/logslurm/slurm-%j.err
## For partition: either prod10, prod 20, prod 40 or prod80
#SBATCH --partition=prod10
## For gres: either 1g.10gb:1 for prod10, 2g.20gb:1 for prod20, 3g.40gb:1 for prod40 or A100.80gb:1 for prod80.
#SBATCH --gres=gpu:1g.10gb:1
## For ntasks and cpus: total requested cpus (ntasks * cpus-per-task) must be in [1: 4 * nMIG] with nMIG = VRAM / 10 (hence 1, 2, 4, 8 for 1g.10gb, 2g.20gb, 3g.40gb, A100.80gb).
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
## Virtual environment
source ~/env/bin/activate
## Perform run
CUDA_VISIBLE_DEVICES=1 time python ~/working_directory/main.py
In both examples, standard output (stdout) will be in the slurm-%j.out
file (the %j will be replaced by the job ID automatically) and the standard error (stderr) will be in the slurm-%j.err
file.
Please note that the directories you specify for the output and the error files must already exist.
Submiting the job script¶
You need to submit your script job.batch with:
$ sbatch /path/to/job.batch
Submitted batch job 29509
which responds with the JobID attributed to the job. For example here, JobID is 29509. The JobID is a unique identifier that is used by many Slurm commands.
Monitoring the job¶
The squeue
command shows the list of jobs:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
29509 prod10 job username R 0:02 1 dgxa100
You can change the default format through the SQUEUE_FORMAT variable. For example by adding the following in your .bash_profile:
export SQUEUE_FORMAT="%.18i %.14P %.8j %.10u %.2t %.10M %20b %R"
resulting in replacing the NODES information (always 1 since there is only the DGX) by the MIG required by the job (column TRES_PER_NODE):
JOBID PARTITION NAME USER ST TIME TRES_PER_NODE NODELIST(REASON)
For more squeue format option, see
If your job is pending for a priority reason, you can get more information about it with the command sprio
. Maybe the priority is given to more occasional users (fairshare), or maybe the other jobs are asking for less time (you can change the time requested for your job with the flag --time), or maybe there are simply just too much jobs at the moment. But don't worry, given enough time everyone will have their jobs completed!
Canceling the job¶
If you need to cancel your job, you can use the scancel
command.
To cancel your job with jobid 29509 (obtained when submitting or through squeue), you would use:
$ scancel 29509
Wrapping it up¶
Now you should have all the tools to start your computations on the DGX! If you haven't already done so, you can explore the rest of the documentation, in particular the available partitions or the page on slurm jobs management.
Finally, don't hesitate to contact us at dgx_support@listes.centralesupelec.fr if you have any questions about the problems you're experiencing or if you'd like to suggest additions to this documentation.
Happy computing!